About the project

This is a very famous course called Introduction to Open Data Science organised at University of Helsinki in autumn 2017. My personal repository is here.

I am working with R version 3.2.5 (2016-04-14) (nickname Very, Very Secure Dishes) and currently I have RStudio version 1.0.136. I found it helpful to adjust the GUI size a bit larger and change the theme to Merbivore: Go to Tools > Global Options > Appearance, and adjust there Zoom and Editor theme. The R Markdown theme called journal is used in this course diary. Command ?html_dependency_bootstrap() to find the list of themes. The font size had to be adjusted bigger by adding this lines to index.Rmd:

<style type="text/css">

body{ /* Normal */
font-size: 20px;
}

</style>


RStudio Exercise 2: Regression and Model Validation

Data Wrangling and General Remarks

First I did some data wrangling, i.e. formatting a dataset to be suitable and nice to be used in an analysis later, in this case in regression analysis. The data wrangling was done in a R Script file create_learning2014.R and the formatted dataset is in a file learning2014.csv. As described in the wrangling file the data was originally collected for international survey of Approaches to Learning. For details and links to further information have a look at the create_learning2014.R at my repository.

After practising and doing a bit of the exercise the RStudio workspace became a bit cluttered. So, I found that a function remove(...) can be used to remove objects from the memory. It is handy to clean up the environment once in a while.

Another important thing is that you have to load the extra libraries every time you start RStudio, even though files create_learning2014.R, .RData and .Rhistory are used to load the situation where you were when quitting last time.

Analysis

Let’s first read the data set into memory and have a look at the data set. Summary off the variables are below:

#getwd()
#NOTE: some functions are commented out to make the knitted output more simple.
learning2014 <- read.csv("data/learning2014.csv", header = TRUE)
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

It consists of 166 observations of 7 variables. The variables deep (deep learning), stra (strategic learning) and surf (surface learning) represents three different dimensions of the way of learning. They have been calculated as means of several other variables in the previous data wrangling. This data is in Likert scale and based on data originally obtained from a questionnaire.

There is a noticeable deviation (skewness) in age at the highest quartile, shifting the mean age away from median value towards the older end of the distribution. This can be seen clearly from a histogram below:

library(ggplot2)
p_age <- ggplot(learning2014, aes(age)) + geom_histogram(binwidth = 1, fill="purple", colour="red") + ggtitle("Histogram of Student's Ages")
p_age

Scatter plots of all the variables using gender as a factor are presented in a matrix below. This is just to show, how much better plots can be produced by GGally and ggplot2 libraries. Those are found below this first picture:

pairs(learning2014[-1], col=learning2014$gender)
**In this scatter plot matrix black dots marks females and red dots marks males**

In this scatter plot matrix black dots marks females and red dots marks males

GGally and ggplot2 produces another, more informative, scatterplot matrix:

library(ggplot2)
library(GGally)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
**In this scatter plot matrix red color marks females and blue color marks males**

In this scatter plot matrix red color marks females and blue color marks males

  • It can be seen from the matrix above that there are remarkably more females than males.
  • It could be analysed if there is statistically significant correlation between gender and attitude (note: there is a tail or a tiny peak of males with low attitude).
  • Gender probably does not explain statistically if one has obtained surface learning as a strategy, although the distributions look a bit different between genders because of kurtosis. It means that only the variances are different between genders when means are about the same. The variability is higher among males.
  • Also correlation between attitude vs. exam points should be checked. Alternative hypothesis could be: higher attitude results in higher points.
  • There could be a statistically significant negative correlation between deep and surf, which is quite obvious according to common sense. In correlation coefficients there is also some difference between genders, which probably is explained by a single female outlier. According to literature, the correlation coefficient is very sensitive for outliers, i.e.. abnormal/unexpected values.
  • Interesting is if there is not statistically significant correlation between deep learning and exam points, especially among females (note, there might be a female outlier which could explain the low correlation coefficient). This relationship needs more research.
  • I cannot see any non-linear relationships. NOTE: it is always important to look both the scatter plot and correlation coefficient as even though the correlation coefficient is low there might be non-linear correlation seen from scatter plot. Also the low correlation coefficient can be a result of the sensitivity for outliers.

Linear regression model

For the following regression analysis the exam points will be a target (aka dependent, response) variable and let’s choose three independent (aka explanatory) variables:

  1. attitude towards data science
  2. use of deep learning strategies
  3. use of surface learning strategies.

Here is a regression model with multiple explanatory variables:

my_model <- lm(points ~ attitude + surf + deep, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + surf + deep, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.35507    4.71237   3.895 0.000143 ***
## attitude     0.34661    0.05766   6.011 1.18e-08 ***
## surf        -1.09114    0.83599  -1.305 0.193669    
## deep        -0.94846    0.79025  -1.200 0.231815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

From the summary above it can be seen that only attitude has statistically significant relationship with exam points (p=1.18e-08 so the risk that this is a wrong conclusion, is extremely low). Standard error (0.05766) of regression coefficient for attitude is not an order of magnitude less than the coefficient itself (0.34661), indicating that there is some variability in the estimate for the coefficient. R-squared is about 20 %. It means that on the average only 20 % of variability in predicted exam points can be explained by these three variables. The F-statistic indicates, would the model be better with fewer variables. In that case the p-value would be higher. Now p-value is 5.217e-08. Removing surf from the model gives a bit lower p-value 2.326e-08. Removing also the deep gives even lower p-value 4.119e-09.

After testing also the other variables and checking the p-values from F-tests, it can be concluded that the best model has only the attitude as an explanatory variable. According to this model below, a student has to rise attitude by three numbers in order to increase exam points by one number. Also the intercept suggests that even if a student has zero attitude, the exam points would be 12 (from the simplified model below), which is quite reasonable.

my_best_model <- lm(points ~ attitude, data = learning2014)
summary(my_best_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Model validation

Let’s make some graphical model validation using the following plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
**A model *my_model* explaining exam points with attitude towards data science, use of deep learning strategies and use of surface learning strategies**

A model my_model explaining exam points with attitude towards data science, use of deep learning strategies and use of surface learning strategies

par(mfrow = c(2,2))
plot(my_best_model, which = c(1,2,5))
**A model *my_best_model* explaining exam points with attitude towards data science**

A model my_best_model explaining exam points with attitude towards data science

As the model is linear regression it is assumed that the relationship between attitude and exam points are linear. From the scatter plots further above it can be seen that this is not exactly the case: there are some observations where exam points are exceptionally low regardless of the attitude. It can be imagined, that these observations reflects some problems – not attitude problems – in gathering exam points.

  1. Residuals vs fitted values is a plot that can be used to check the model assumption that the residuals (ε, size of the errors) do not depend on explanatory variables but only on random variability (in other words, residuals has a constant variance \(σ^{2}\). In the plot the dots should be randomly distributed, otherwise it indicates a problem.
    1. In these cases there can be seen some tails and also a group of outliers near the x-axle.
  2. Normal QQ-plot can be used to check, are residuals distributed normally. If they are, the dots should be aligned on the line.
    1. In these cases there can be seen some tails.
  3. Residuals vs leverage is a plot that can be used to assess how big effect a single observation has to the model. In the plot a big x value means exceptionally big effect and refers to that the observation in question might be an outlier (Note: if an exceptional value (y axle) of target variable is near the mean of explanatory variables of the model, the observation does not have so big (leverage) effect to the regression line).
    1. In the first model there is clearly a single exceptional value.

RStudio exercise 3: Logistic Regression

Data Wrangling

For details and links to further information about the data have a look at the wrangling script create_alc.R at my repository. The wrangled data is here.

Analysis

The data has been gathered among students attending mathematics and Portugal language courses in two Portuguese schools. The variables are explained here at the source of data under the title Attribute information. Originally the data was gathered to study predicting student performance in secondary education. In this analysis the focus is on student alcohol consumption. Let’s first read the data set into memory and have a look at the variable names in the data set and summeries of the variables.

#getwd()
#NOTE: some functions are commented out to make the knitted output more simple.
alc <- read.csv("data/student_alc_analysis.csv", header = TRUE)
#colnames(alc)
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##  NA's :0        
##                 
## 

Here is a graphical summary of the variables (I would call it a kind of histogram of factors), although it is not perfect as for example the scale of x-axles for the variables absences, G1, G2 and G3 are not in correct order. It seems to be presented in alphanumerical order (1, 10, 11, 12, …, 19, 2, 20, 21, …) instead of numerical. Tip: Right click the image of plots and choose “Show image” to zoom closer.

gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", ncol = 3, scales = "free") + geom_bar()

Hypothesis based on the data below: Being a male increases the risk to adopt high alcohol consumption habits.

alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
## 
##   high_use    sex count mean_grade
##      <lgl> <fctr> <int>      <dbl>
## 1    FALSE      F   156   11.39744
## 2    FALSE      M   112   12.20536
## 3     TRUE      F    42   11.71429
## 4     TRUE      M    72   10.27778

Hypothesis based on the data below: Living in urban area decreases alcohol consumption.

alc %>% group_by(high_use, address) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
## 
##   high_use address count mean_grade
##      <lgl>  <fctr> <int>      <dbl>
## 1    FALSE       R    50   10.40000
## 2    FALSE       U   218   12.04128
## 3     TRUE       R    31   10.74194
## 4     TRUE       U    83   10.83133

Hypothesis based on the data below: Getting educational support from family decreases alcohol consumption.

alc %>% group_by(high_use, famsup) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
## 
##   high_use famsup count mean_grade
##      <lgl> <fctr> <int>      <dbl>
## 1    FALSE     no    98   11.85714
## 2    FALSE    yes   170   11.66471
## 3     TRUE     no    46   10.47826
## 4     TRUE    yes    68   11.02941

Hypothesis based on the data below: Attenging paid extra classes decreases alcohol consumption.

alc %>% group_by(high_use, paid) %>% summarise(count = n(), mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: high_use [?]
## 
##   high_use   paid count mean_grade
##      <lgl> <fctr> <int>      <dbl>
## 1    FALSE     no   148   11.26351
## 2    FALSE    yes   120   12.31667
## 3     TRUE     no    57   10.66667
## 4     TRUE    yes    57   10.94737

To be continued…


RStudio exercise 4: Clustering and classification

Analysis

#<!--NOTE: some functions are commented out to make the knitted output more simple.-->
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The data used in this exercise is called Boston and comes from R library MASS. The data frame is about housing values in suburbs of Boston and it has 506 observations and 14 variables which are explained on this webpage. In short the variables are about crime rate, urban planning, housing, air pollution, economy, school environment, skin colour. The variables chas (Charles River dummy variable which is 1 if tract bounds river and 0 otherwise) and rad (index of accessibility to radial highways) are integers others are numerical.

Let’s have a look at the data

Summary off the variables are below:

library(GGally)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Scatter plots of all the variables are presented in a matrix below. Tip: Right click the image of plots and choose “Show image” to zoom closer.

library(ggplot2)
p41 <- ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))
p41

Most of the variables are far from normally distributed. This can be seen clearly from a histogram below as there is a couple of suburbs with extreme crime rates per capita:

p4_crime <- ggplot(Boston, aes(crim)) + geom_histogram(binwidth = 1, fill="purple", colour="red") + ggtitle("Histogram of per capita crime rate by town")
p4_crime

korrelaatiot <- cor(Boston)

Let’s have a look again at the scatter plots:

  • distributions: There are many phenomenons with clear non-linear relationship (eg. nox vs. distance, meaning air pollution is diluted non-linearly when moving away from pollution sources). The variable black gets smaller values as the proportion of black skin colour rises: the mean proportion of black skin colour is about 3.3% and median proportion is 0.43%.
  • histograms: Only the rm (average number of rooms per dwelling) are closely normally distributed. Indus gives an impression about two industrial areas.
  • Corralation coefficiensts: High correlation (>0.5, bold font if >0.7, + indicates positive correlation and - negative) are found between the following variables:
    • crim+rad, crim+trad
    • zn+dis, zn-age, zn-indus, zn-nox,
    • indus+nox, indus+tax, indus+age, indus+lstat, indus+rad, indus-dis,
    • nox+age, nox+tax, nox+rad, nox+lstat, nox-dis,
    • rm+medv, rm-lstat
    • age+lstat, age+tax, age-dis,
    • dis-tax,
    • rad+tax,
    • tax+lstat,
    • ptratio-medv
    • lstat-medv.
  • Okay let’s think, what was the point to do this list manually after making a data object by korrelaatiot <- cor(Boston) and using window view (View(korrelaatiot)) of RStudio? Not always are the most interesting correlations the ones that has a high R-squared.

So, obviously, it would have been a lot easier to use a package corrplot and a function:

#corrplot(korrelaatiot, type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, method="circle")

Unfortunately, the package corrplot is not available for R version 3.2.5 in my use.

Let’s standardize (scale and center) the dataset:

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Now the values are centered around the means and scaled by dividing by standard deviation to make the value proportional to the variability.

#boston_scaled is matrix object as function class(boston_scaled) would tell. Let's change it to a data.frame:
boston_scaled <- as.data.frame(boston_scaled)

In LDA the target variable has to be categorical. So let’s create a categorical crime rate variable (crime) by cutting the original by quantiles to get the high, low and middle rates of crime:

scaled_crim <- boston_scaled$crim
bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(scaled_crim, breaks = bins, label = c("low", "med_low", "med_high", "high"),  include.lowest = TRUE)
#the following shows the table of the new factor crime:
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#remove original crim from the dataset (note that here "select" is used from a package dplyr)
#str(boston_scaled)
boston_scaled <- dplyr::select(boston_scaled, -crim)
#then let's insert the new variable to data.frame:
boston_scaled <- data.frame(boston_scaled, crime)

The standardized dataset is then randomly split to a test (containing 20 % of observations) and train (80 %) sets.

n <- nrow(boston_scaled)
# choose randomly 80% of the n rows and save those rows to ind
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear Discriminant Analysis (LDA)

Below is linear discriminant analysis fitted on the train set. Crime is the target variable and all the other variables are predictor variables:

lda.fit <- lda(crime ~ ., data = train)
#The dot above means "all the other variables"
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2103960 0.2623762 0.2648515 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0330779 -0.9351223 -0.12375925 -0.8641087  0.47350937
## med_low  -0.1780204 -0.2898088 -0.04073494 -0.5507918 -0.08783836
## med_high -0.3974409  0.1840786  0.17338039  0.4026956  0.13785058
## high     -0.4872402  1.0170108 -0.01476176  1.0611923 -0.45675530
##                 age        dis        rad        tax     ptratio
## low      -0.8687437  0.9052106 -0.6785025 -0.7330079 -0.43003636
## med_low  -0.2904168  0.2410603 -0.5603162 -0.4899712 -0.07890624
## med_high  0.4471043 -0.3999727 -0.4119715 -0.3097766 -0.29233626
## high      0.8186070 -0.8613334  1.6392096  1.5148289  0.78203563
##               black       lstat        medv
## low       0.3787368 -0.78792882  0.52685014
## med_low   0.3116427 -0.14369321 -0.00548579
## med_high  0.0607768  0.02872742  0.17337546
## high     -0.8704036  0.87844244 -0.67149365
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.080288394  0.798401848 -0.85574427
## indus    0.009085977 -0.331330209  0.20879156
## chas    -0.085044182 -0.072147258  0.05647243
## nox      0.392363306 -0.450558433 -1.50467748
## rm      -0.119665428 -0.112323335 -0.12245063
## age      0.291589184 -0.231291651 -0.28058919
## dis     -0.021069292 -0.133776901 -0.13123075
## rad      3.115211027  0.978498140  0.01651773
## tax      0.067120726 -0.089412330  0.58548401
## ptratio  0.141851916  0.063398863 -0.33072590
## black   -0.130331524  0.009493206  0.13281074
## lstat    0.222164483 -0.331634692  0.27679339
## medv     0.211957177 -0.348748779 -0.30279330
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9450 0.0423 0.0127

In the output of LDA there can be seen from the last three rows that linear discriminant 1 (LD1; linear combination of variables that separate the target variable classes/categories) explains 95 % of the between group variance in crime rate categories (per capita crime rate by town).

Below is the LDA biplot for the train set:

#To draw a LDA biplot, the following has to be done:
#Numeric crime classes:
train_classes <- as.numeric(train$crime)
# the biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#The biplot:
plot(lda.fit, dimen = 2, col = train_classes, pch = train_classes)
lda.arrows(lda.fit, myscale = 2)

From the picture above, it looks like the rad (“index of accessibility to radial highways”) is strongly related (high correlation, like it can be read from the scatter plot further above: 0,626) to the high crime rate. That is because of the angle between the rad arrow and a LD1 axis on small. Although, as the lenght of rad arrow is so long, it means that the standard deviation in rad is also high.

Next let’s do some model validation by using test dataset. So we check how accurately the fitted LDA model can be used in predictions. You can find help in R using a function ?predict.lda.

#As crime is the target variable, so let's remove that one from the test dataset.
#First save the correct crime classes from test data
test_classes <- test$crime
#then remove the crime variable from test data
test <- dplyr::select(test, -crime)
#str(test)
#Note: the model is the first argument in the predict function. 
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = test_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        1    0
##   med_low   12      21        8    0
##   med_high   1       5       13    1
##   high       0       0        0   20
#### 
##Might be needed for drawing biplot: 
##first let's make numeric crime classes:
##test_classes <- as.numeric(test_classes)
#VÄÄRÄ: lda.pred <- lda(crime ~ ., data = test)
#The dot above means "all the other variables"
#lda.pred

From the above cross tabulation can be seen that the learned LDA model (lda.fit) performs quite well only in high and medium_low categories as medium_high and low crime rates got significant proportions of both right and wrong predictions.

To be continued: Clustering…

Data Wrangling

…in a R script file create_human.R. To be published at Github.


Analysis

To be published in 10th Dec.

#<!--NOTE: some functions are commented out to make the knitted output more simple.-->